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ABSTRACT 



We present libsharp, a code library for spherical harmonic transforms (SHTs), which evolved from the libpsht library and ad- 
dresses several of its shortcomings, such as adding MPI support for distributed memory systems and SHTs of fields with arbitrary 
spin, but also supporting new developments in CPU instruction sets like the Advanced Vector Extensions (AVX) or fused multiply- 
accumulate (FMA) instructions. The library is implemented in portable C99 and provides an interface that can be easily accessed from 
other programming languages such as C++, Fortran, Python, etc. Generally, libsharp's performance is at least on par with that of its 
predecessor; however, significant improvements were made to the algorithms for scalar SHTs, which are roughly twice as fast when 
using the same CPU capabilities. The library is available at http://sourceforge.net/projects/libsharp/ under the terms of 
the GNU General Public License. 
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1. Motivation 



While the original libpsht library presented by Reinecke 
d201 lb fulfilled most requirements on an implementation of 
spherical harmonic transforms (SHTs) in the astrophysical con- 
text at the time, it still left several points unaddressed. Some of 
those were already mentioned in the original publication: sup- 
port for SHTs of arbitrary spins and parallelisation on computers 
with distributed memory. 

Both of these features have been added to libpsht in the 
meantime, but other, more technical, shortcomings of the library 
have become obvious since its publication, which could not be 
fixed within the libpsht framework. 

One of these complications is that the library design did 
not anticipate the rapid evolution of microprocessors during the 
past few years. While the code supports both traditional scalar 
arithmetic as well as SSE2 instructions, adding support for the 
newly released Advanced Vector Extensions (AVX) and fused 
multiply-accumulate instructions (FMA3/FMA4) would require 
adding significant amounts of new code to the library, which is 
inconvenient and very likely to become a maintenance burden in 
the long run. Using proper abstraction techniques, adding a new 
set of CPU instructions could be achieved by only very small 
changes to the code, but the need for this was unfortunately not 
anticipated when libpsht was written. 

Also, several new, highly efficient SHT implementations 
have been publis hed in the mea ntime; most nota bly Wavemoth 
(Seljebotn 2012) and shtns (ISchaefferl l20H . These codes 
demonstrate that libpsht's computational core did not make 
the best possible use of the available CPU resources. Note that 
Wavemoth is currently an experimental research code not meant 
for general use. 

To address both of these concerns, the library was redesigned 
from scratch. The internal changes also led to a small loss of 
functionality; the new code no longer supports multiple simul- 



taneous SHTs of different type (i.e. having different directions 
or different spins). Simultaneous transforms of identical type are 
still available, however. 

As a fortunate consequence of this slight reduction in func- 
tionality, the user interface could be simplified dramatically, 
which is especially helpful when interfacing the library with 
other programming languages. 

Since backward compatibility is lost, the new name 
libsharp was chosen for the resulting code, as a shorthand for 
"Spherical HARmonic Package". 

The decision to develop libsharp instead of simply using 
shtns was taken for various reasons: shtns does not support 
spin SHTs or allow MPI parallelisation, it requires more main 
memory than libsharp, which can be problematic for high- 
resolution runs, and it relies on the presence of the FFTW li- 
brary. Also, shtns uses a syntax for expressing SIMD opera- 
tions, which is cmrently only supported by the gcc and clang 
compilers, thereby limiting its portability at least for the vec- 
torised version. Finally, libsharp has support for partial spher- 
ical coverage and a wide variety of spherical grids, including 
Gauss-Legendre, ECP, and HEALPix. 



2. Problem definition 

This secti on conta i ns a q uick recapitulation of equations pre- 
sented in iReineckd d201lli . for easier reference. 

We assume a spherical grid with Ng- iso-latitude rings (in- 
dexed by y). Each of these in turn consists of N Vt y pixels (indexed 
by x), which are equally spaced in <p, the azimuth of the first ring 
pixel being ifoj. 

A continuous spin-.? function defined on the sphere with a 
spectral band limit of Z max can be represented either as a set of 
spherical harmonic coefficients s ai m , or a set of pixels p Ky . These 
two representations are connected by spherical harmonic synthe- 
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sis (or backward SHT): 

tnn 4™, 
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and spherical harmonic analysis (or forward SHT): 

Ng-l N„-l 



(1) 



Zv-i / 2nimx\ 

> h\, exp -wi^doj, - — — , (2) 

y=0 ,v=0 V / 



where s Ai, n ( ff) := s Yi m (§, 0) and w y are quadrature weights. 
Both transforms can be subdivided into two stages: 
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Eqs. |3] and [6] can be computed using fast Fourier transforms 
(FFTs), while Eqs.|4]and|5] which represent the bulk of the com- 
putational load, are the main target for optimised implementa- 
tion in libsharp. 

3. Technical improvements 

3. 1 . General remarks 

As the implementation language for the new library, ISO C99 
was chosen. This version of the C language standard is more 
flexible than the C89 one adopted for libpsht and has gained 
ubiquitous compiler support by now. Most notably, C99 allows 
definition of new variables anywhere in the code, which im- 
proves readability and eliminates a common source of program- 
ming mistakes. It also provides native data types for complex 
numbers, which allows for a more concise notation in many 
places. However, special care must be taken not to make use of 
these data types in the library's public interface, since this would 
prevent interoperability with C++ codes (because C++ has a dif- 
ferent approach to complex number support). Fortunately, this 
drawback can be worked around fairly easily. 

A new approach was required for dealing with the growing 
variety of instruction sets for arithmetic operations, such as tradi- 
tional scalar instructions, SSE2 and AVX. Rewriting the library 
core for each of these alternatives would be cumbersome and 
error-prone. Instead we introduced the concept of a generic "vec- 
tor" type containing a number of double-precision IEEE val- 
ues and defined a set of abstract operations (basic arithmetics, 
negation, absolute value, multiply-accumulate, min/max, com- 
parison, masking and blending) for this type. Depending on the 
concrete instruction set used when compiling the code, these op- 
erations are then expressed by means of the appropriate opera- 
tors and intrinsic function calls. The only constraint on the num- 
ber of values in the vector type is that it has to be a multiple of 
the underlying instruction set's native vector length (1 for scalar 
arithmetic, 2 for SSE2, 4 for AVX). 



for b = <all submaps or "blocks"> 




for m = [0;mmax] 


/ / OpenMP 


for 1 = [m;lmax] 




precompute recursion coefficients 




end 1 




for y = <all rings in submap b> 


// SSE/AVX 


for 1 = [m;lmax] 




compute s_lambda_lm(theta_y) 




for j = <all jobs> 




accumulate F(m,theta_y, j) 




end j 




end 1 




end y 




end m 




for y = <all rings in submap b> 


// OpenMP 


for j = <all jobs> 




compute map(x,y,j) using FFT 




end j 




end y 




end b 




Fig. 1. Schematic loop structure of libsharp's shared-memory 


synthesis code. 




for b = <all submaps or "blocks"> 




for y = <all rings in submap b> 


// OpenMP 


for j = <all jobs> 




compute G(m, theta_y , j) using FFT 




end j 




end y 




for m = [0;mmax] 


// OpenMP 


for 1 = [m;lmax] 




precompute recursion coefficients 




end 1 




for y = <all rings in submap b> 


// SSE/AVX 


for 1 = [m;lmax] 




compute s_lambda_lm(theta_y) 




for j = <all jobs> 




compute contribution to a_lm(j) 




end j 




end 1 




end y 




end m 




end b 





Fig. 2. Schematic loop structure of libsharp's shared-memory 
analysis code. 



Using this technique, adding support for new vector instruc- 
tion sets is straightforward and carries little risk of breaking ex- 
isting code. As a concrete example, support for the FMA4 in- 
structions present in AMD's Bulldozer CPUs was added and 
successfully tested in less than an hour. 



3.2. Improved loop structure 

After publication of SHT implementations, which perform sig- 
nificantly better than libpsht, e specially for s = transforms 
(ISeliebotnll20il ISchaefferil2013h ." it became obvious that some 
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bottleneck must be present in libpsht's implementation. This 
was identified with libpsht's approach of first computing a 
whole Z-vector of s ^im(^) in one go, storing it to main mem- 
ory, and afterwards re-reading it sequentially whenever needed. 
While the Z-vectors are small enough to fit into the CPU's Level- 
1 cache, the store and load operations nevertheless caused some 
latency. For 5 = transforms with their comparatively low arith- 
metic operation count (compared to the amount of memory ac- 
cesses), this latency could not be hidden behind floating point 
operations and so resulted in a slow-down. This is the most likely 
explanation for the observation that libpsht's 5 = SHTs have 
a much lower FLOP rate compared to those with 5^0. 

It is possible to avoid the store/load overhead for the s Ai m (-&) 
by applying each value immediately after it has been computed, 
and discarding it as soon as it is not needed any more. This ap- 
proach is reflected in the loop structure shown in Figs. Q~]and 
[31 which differs from the one in iReineckel (1201 lb mainly by the 
fusion of the central loops over I. 

In this context another question must be addressed: the loops 
marked as "SSE/AVX" in both figures are meant to be executed 
in "blocks", i.e. by processing several y indices simultaneously. 
The block size is equivalent to the size of the generic vector 
type described in Sect. 13.11 The best value for this parameter 
depends on hardware characteristics of the underlying computer 
and therefore cannot be determined a priori. Libsharp always 
uses a multiple of the system's natural vector length and esti- 
mates the best value by running quick measurements whenever 
a specific SHT is invoked for the first time. This auto-tuning step 
approximately takes a tenth of a wall-clock second. 

Due to the changed central loop of the SHT implementation, 
it is no longer straightforward to support multiple simultaneous 
transforms with differing spins and/or directions, as libpsht 
did - this would lead to a combinatorial explosion of loop bodies 
that have to be implemented. Consequently, libsharp, while 
still supporting simultaneous SHTs, restricts them to have the 
same spin and direction. Fortunately, this is a very common case 
in many application scenarios. 

3.3. Polar optimisation 

As previously mentioned in iReinec kel (l2011l) . much CPU time 
can be saved by simply not computing terms in Eqs. (|4]i and 
(O for which s Ai m ("ff) is so small that their contribution to the 
result lies well below the numerical accuracy. Since this situation 
occurs for rings lying close to the poles and high values of to, 
iSchaefferf (120131) referred to it as "polar optimisation". 

To determine which t erms can be omitted, libsh arp uses 
the approach described in Prezea u & Reineckd (1201 Ob . In short, 
all terms for which 



Vto 2 + s 2 - 2ms cos # - l max sin ■& > T 



(7) 



are skipped. The parameter T is tunable and determines the over- 
all accuracy of the result. Libsharp models it as 



T = max(100,0.01/ max ), 



(8) 



which has been verified to produce results equivalent to those of 
SHTs without polar optimisation. 

4. Added functionality 

4. 1 . SHTs with arbitrary spin 

While the most widely used SHTs in cosmology are performed 
on quantities of spins and 2 (i.e. sky maps of Stokes I and 



Q/U parameters), there is also a need for transforms at other 
spins. Lensing comp utations requ ire SHTs of spin-1 and spin-3 
quantities (see, e.g.. lLewisll2005l) . The most important motiva- 
tion f or high-spin SHTs, howe ver, are all-sky convol ution codes 
(e.g. IWandelt & Gorskil 120011; IPrezeau & Reineckd 120101) an d 
deconvolution map-makers (e.g. [Keihanen & Reinecke 20121) . 
The computational cost of these algorithms is dominated by cal- 
culating expressions of the form 



■ illH.X 



(9) 



/=max(|m|,|A|) 



where a and b denote two sets of spherical harmonic coeffi- 
cients (typically of the sky and a beam pattern) and d are the 
Wigner d matrix elements. These expressions can be interpreted 
and solved efficiently as a set of (slightly modified) SHTs with 
spins ranging from to £ max < Z max , which in today's applica- 
tions can take on values much highe r than 2. 

As was discussed in iReineckd d201 ll) . the algorithms used 
by libpsht for spin-1 and spin-2 SHTs become inefficient 
and inaccurate for higher spins. To support such transforms in 
libsharp, another approach was therefore implemented, which 
is based on the recursion for W igner d matrix elements presented 
in IPrezeau & Reineckd (12010I) . 

Generally, the spin-weighted spherical harmonics are related 
to the Wigner d matrix elements via 



s A lm (§) = (-1)" 



21 + 1 

47T 



(10) 



dGoldberg et al.lfl967l) . It is possible to compute the d l mm ,(&) us- 
ing a three-term recursion in / very similar to that for the scalar 



cos § - 



1(1 + 1) 



y/(l 2 - m 2 )(l 2 



l(2l+\) 



cm 



Vgj + l) 2 -m 2 ][(l + \)2 - m > 2 ] 
+ (l+l)(2l+l) d ™' (§) (U) 

(iKostelec & Rockmordl2008f) . The terms depending only on /, 
to, and to' can be re-used for different colatitudes, so that the 
real cost of a recursion step is two additions and three multipli- 
cations. 

In contrast to the statements made in iMcEwen & Wiauxl 

(1201 lh . this recursion is numerically stable when performed in 
the direction of increasing I; see, e.g., Sect. 15.1.21 for a practical 
confirmation. It is necessary, however, to use a digital floating- 
point representation with enhanced exponent range to avoid un- 
derfl ow during the recursio n, as is discussed in some detail in 
IPrezeau & Reineckd (1201 Ob . 



4.2. Distributed memory parallelisation 

When considering that, in current research, the required band 
limit for SHTs practically never exceeds Z max = 10 4 , it seems 
at first glance unnecessary to provide an implementation sup- 
porting multiple nodes. Such SHTs fit easily into the memory 
of a single typical compute node and are carried out within a 
few seconds of wall clock time. The need for additional paral- 
lelisation becomes apparent, however, as soon as the SHT is no 
longer considered in isolation, but as a (potentially small) part 
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for m = <all local m> 


// 


OpenMP 


for 1 = [m;lmax] 






precompute recursion coefficients 




end 1 






for y = <all rings in the map> 


// 


SSE/AVX 


for 1 = [mjlmax] 






compute s_lambda_lm(theta_y) 






for j = <all jobs> 






a rniTniil ^ 1" p Ffm theta v 

O. l_ LULL LI .1. CI L. <3 1 \_1LI , LI1C L CI V , 1 J 






end j 












end y 






end m 






rearrange F(m,theta_y, j) among tasks 


// MPI 


for y = <all local rings> 


// 


OpenMP 


for j = <all jobs> 






compute map(x,y,j) using FFT 






end j 






end y 







Fig. 3. Schematic loop structure of libsharp's distributed- 
memory synthesis code. 



Fig. 4. Schematic loop structure of libsharp's distributed- 
memory analysis code. 



of another algorithm, which is libsharp's main usage scenario. 
In such a situation, large amounts of memory may be occupied 
by data sets unrelated to the SHT, therefore requiring distribu- 
tion over multiple nodes. Moreover, there is sometimes the need 
for very many SHTs in sequence, e.g. if they are part of a sam- 
pling process or an iterative solver. Here, the parallelisation to 
a very large number of CPUs may be the only way of reduc- 
ing the time-to-solution to acceptable levels. Illustrative e xam- 
ples for this are the Commander code (Eriksen et al. 2008 ]) and 
the a rtDeco deconvolution mapmaker (Keihanen & Reinecke] 
120121) : for the processing of high-resolution Planck data, the lat- 



ter is expected to require over 100GB of memory and several 
hundred CPU cores. 

Libsharp provides an interface that allows collective execu- 
tion of SHTs on multiple machines with distributed memory. It 
makes use of the MPf[] interface to perform the necessary inter- 
process communication. 

In contrast to the standard, shared-memory algorithms, it 
is the responsibility of the library user to distribute map data 
and o/ m over the individual computers in a way that ensures 
proper load balancing. A very straightforward and reliable way 
to achieve this is a "round robin" strategy: assuming comput- 
ing nodes, the map is distributed such that node i hosts the map 
rings i, i+N, i+2N, etc. (and their counterparts on the other hemi- 
sphere). Similarly, for the spherical harmonic coefficients, node 
i would hold all ct\ m for m = i, i + N, i + 2N, etc. Other efficient 
distribution strategies do of course exist and may be advanta- 
geous, depending on the circumstances under which libsharp 
is called; the library makes no restrictions in this respect. 

The SHT algorithm for distributed memory architectures is 
analogous to the one used i n the S 2 HAT packagtQ and first 
published in Szvdlarski et alj d2013l) : its structure is sketched 
in Figs. [3] and [4] In addition to the S 2 HAT implementation, 
the SHT will be broken down into smaller chunks if the av- 
erage number of map rings per MPI task exceeds a certain 
threshold. This is analogous to the use of chunks in the scalar 
and OpenMP-parallel implementations and reduces the memory 
overhead caused by temporary variables. 

It should be noted that libsharp supports hybrid MPI and 
OpenMP parallelisation, which allows, e.g., running an SHT on 
eight nodes with four CPU cores each, by specifying eight MPI 
tasks, each of them consisting of four OpenMP threads. In gen- 
eral, OpenMP should be preferred over MPI whenever shared 
memory is available (i.e. at the computing node level), since the 
OpenMP algorithms contain dynamic load balancing and have a 
smaller communication overhead. 

4.3. Map synthesis of first derivatives 

Generating maps of first derivatives from a set of a\ m is closely 
related to performing an SHT of spin 1 . A specialised SHT mode 
was added to libsharp for this purpose; it takes as input a 
set of spin-0 ai,„ and produces two maps containing d f/d§ and 
df/(d(p sin §), respectively. 

4.4. Support for additional spherical grids 

Direct support for certain classes of spherical grids has been ex- 
tended in comparison to libpsht; these additions are listed be- 
low in detail. It must be stressed, however, that libsharp can - 
very much as libpsht does - perform SHTs on any iso-latitude 
grid with equidistant pixels on each ring. This very general class 
of pixelisations includes, e.g., certain types of partial spherical 
coverage. For these general grids, however, the user is responsi- 
ble for providing correct quadrature weights when performing a 
spherical harmonic analysis. 

4.4.1 . Extended support for ECP grids 

Libpsht provides explicit support for HEALPix grids, Gauss- 
Legendre grids, and a subset of equidistant cylindrical projection 



1 http://en.wikipedia.org/wiki/Message_Passing_Interface 

2 http : //code . google . com/p/s2hat- library/ 



for y = <all local rings> // OpenMP 

for j = <all jobs> 

compute G(m,theta_y, j) using FFT 

end j 
end y 

rearrange G(m,theta_y, j) among tasks // MPI 

for m = <all local m> // OpenMP 

for 1 = [m;lmax] 

precompute recursion coefficients 
end 1 

for y = <all rings in the map> // SSE/AVX 
for 1 = [mjlmax] 

compute s_lambda_lm(theta_y) 
for j = <all jobs> 

compute contribution to a_lm(j) 
end j 
end 1 
end y 
end m 
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(ECP) grids. The latter are limited to an even number of rings at 
the colatitudes 



*„=«i±P* ne^N-l]. 



(12) 



The associa t ed quadrature w eights are given by Fejer's first rule 
dFeieril 1933b iGautschill 1 967h . 

Libsharp extends ECP grid support to allow even and odd 
numbers of rings, as well as the colatitude distributions 



= — , ne[l;N-l] 
(corresponding to Fejer's second rule), and 



#„ = — , ne[0;N] 



(13) 



(14) 



(corresponding to Clenshaw-Curtis qua drature). This last pixeli- 
sation is identical to the one adopted in Huff enberger & Wandelj 
d2010h . 

Accurate computation of the quadrature weights for these 
pixelisations is non trivial; lib sharp adopts the FFT-based ap- 
proach described in Waldvogel (2006) for this puipose. 

4.4.2. Reduced Gauss-Legendre grid 

The polar optimisation described in Sect. 13. 3 1 implies that it is 
possible to reduce the number of pixels per ring below the the- 
oretically required value of 2Z max + 1 close to the poles. Eq. (O 
can be solved for m (at a given s, Z max and &), and using 2m + 1 
equidistant pixels in the corresponding map ring results in a pix- 
elisation that can represent a band-limited function up to the de- 
sired precision, although it is no longer exact in a mathemati- 
cal sense. If this number is further increased to the next number 
composed entirely of small prime factors (2, 3, and 5 are used in 
libsharp's case), this has the additional advantage of allowing 
very efficient FFTs. 

Libsharp supports this pixel reduction technique in the 
form of a thinned-out Gauss-Legendre grid. At moderate to high 
resolutions (/ max > 1000), more than 30% of pixels can be saved, 
which can be significant in various applications. 

It should be noted that working with reduced Gauss- 
Legendre grids, while saving considerable amounts of mem- 
ory, does not change SHT execution times significantly; all po- 
tential savings are already taken into account, for all grids, by 
libsharp's implementation of polar optimisation. 

4.5. Adjoint and real SHTs 

Since Eq. ([TJ is a linear transform, we can introduce the notation 

p = Ya (15) 

for a vector a of spherical harmonic coefficients and correspond- 
ing vector p of pixels. Similarly, one can write Eq. (0 as 



a = Y r Wp, 



where W is a diagonal matrix of quadrature weights. When in- 
cluding SHTs as operators in linear systems, one will often need 
the adjoint spherical harmonic synthesis, \ T , and the adjoint 
spherical harmonic analysis, WY. For instance, if a is a ran- 
dom vector with covariance matrix C in the spherical harmonic 
domain, then its pixel representation Ya has the covariance ma- 
trix YCY r . Multiplication by this matrix requires the use of the 



adjoint synthesis, which corresponds to analysis with a differ- 
ent choice of weights. Libsharp includes routines for adjoint 
SHTs, which is more user-friendly than having to compensate 
for the wrong choice of weights in user code, and also avoids an 
extra pass over the data. 

For linear algebra computations, the vector a must also in- 
clude o/ m with m < 0, even if libsharp will only compute 
the coefficients for m > 0. The use of the real spherical har- 
monics convention is a convenient way to include negative m 
without increasing the computational workload by duplicating 
all coefficients. For the definition we refer to the appendix of 
Ide Oliveira-Costa et al.l d2004l) . The convention is supported di- 
rectly in libsharp, although with a restriction in the storage 
scheme: The coefficients for m < must be stored in the same 
locations as the corresponding imaginary parts of the complex 
coefficients, so that the pattern in memory is [a^ m , ai- m ], 

5. Evaluation 

Most tests were performed on the SuperMUC Petascale System 
located at the Leibniz-Rechenzentrum Garching. This system 
consists of nodes containing 32GB of main memory and 16 Intel 
Xeon E5-2680 cores running at 2.7GHz. The exception is the 
comparison with Wavemoth, which was performed on the Abel 
cluster at the University of Oslo on very similar hardware; Xeon 
E5-2670 at 2.6 GHz. 

The code was compiled with gcc version 4.7.2. The Intel 
compiler (version 12.1.6) was also tested, but produced slightly 
inferior code. 

Except where noted otherwise, test calculations were per- 
formed using the reduced Gauss-Legendre grid (see Sect. 14.4.21 ) 
to represent spherical map data. This was done for the pragmatic 
purpose of minimising the tests' memory usage, which allowed 
going to higher band limits in some cases, as well as to demon- 
strate the viability of this pixelisation. 

The band limits adopted for the tests all obey / max = 2" - 1 
with n e N (except for those presented in Sect. 15.2.2b . This is 
done in analogy to most other papers on the subject, but leads to 
some unfamiliar numbers especially at very high Z max . 

The number of cores used for any particular run always is a 
power of 2. 

5. 1 . Accuracy tests 

5.1.1. Comparison with other implementations 

The numerical equivalence of libsharp's SHTs to existing im- 
plementations was verified by running spherical harmonic syn- 
thesis transforms on a Gauss-Legendre grid at Z max = 50 and 
spins 0, 1, and 2 with both libsharp and libpsht, and com- 
paring the results. The differences of the results lay well within 
the expected levels of numerical errors caused by the finite preci- 
sion of IEEE numbers. Spherical harmonic analysis is implicitly 
tested by the experiments in the following sections. 



(!6) 5.1 .2. Evaluation of SHT pairs 



To test the accuracy of libsharp's transforms, sets of spin=0 
ai m coefficients were generated by setting their real and imagi- 
nary parts to numbers drawn from a uniform random distribu- 
tion in the range [-1;1[ (with exception of the imaginary parts 
for m = 0, which have to be zero for symmetry reasons). This 
data set was transformed onto a reduced Gauss-Legendre grid 
and back to spherical harmonic space again, resulting in S/ m . 
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Fig. 5. Maximum and rms errors for inverse/forward spin=0 
SHT pairs at different / max . 



Fig. 6. Strong scaling scenario: accumulated wall-clock time for 
a spin=2 SHT pair with / raax = 16383 run on various numbers of 
cores. 



The rms and maximum errors of this inverse/forward trans- 
form pair can be written as 



and 



max \ s ai„ 

lm 



(17) 
(18) 



Fig. [5] shows the measured errors for a wide range of band 
limits. As expected, the numbers are close to the accuracy limit 
of double precision IEEE numbers for low / max ; rms errors in- 
crease roughly linearly with the band limit, while the maximum 

3/2 

error seems to exhibit an l^ ax scaling. Even at / max = 262143 
(which is extremely high compared to values typically required 
in cosmology), the errors are still negligible compared with the 
uncertainties in the input data in today's experiments. 

Analogous experiments were performed for spins 2 and 37, 
with very similar results (not shown). 




Fig. 7. Weak scaling scenario: wall-clock time for a spin=0 SHT 
pair run on various numbers of cores, with constant amount of 
work per core (/ max (A0 = 4096 ■ JV I/3 - 1). 



5.2. Performance tests 

Determining reliable execution times for SHTs is nontrivial at 
low band limits, since intermittent operating system activity can 
significantly distort the measurement of short time scales. All 
libsharp timings shown in the following sections were ob- 
tained using the following procedure: the SHT pair in question 
is executed repeatedly until the accumulated wall-clock time ex- 
ceeds 2 seconds. Then the shortest measured wall-clock time for 
synthesis and analysis is selected from the available set. 

5.2.1. Strong scaling test 

To assess strong scaling behaviour (i.e. run time scaling for a 
given problem with fixed total workload), a spin=2 SHT with 
/max=16383 was carried out with differing degrees of parallelisa- 
tion. The accumulated wall-clock time of these transforms (syn- 
thesis + analysis) is shown in Fig. [6] It is evident that the scaling 
is nearly ideal up to 16 cores, which implies that parallelisation 
overhead is negligible in this range. Beyond 16 cores, MPI com- 
munication has to be used for inter-node communication, and 
this most likely accounts for the sudden jump in accumulated 
time. At even higher core counts, linear scaling is again reached, 
although with a poorer proportionality factor than in the intra- 
node case. Finally, for 1024 and more cores, the communication 
time dominates the actual computation, and scalability is lost. 



5.2.2. Weak scaling test 

Weak scaling behaviour of the algorithm is investigated by 
choosing problem sizes that keep the total work per core con- 
stant, in contrast to a fixed total workload. Assuming an SHT 
complexity of <9(/ max ), the band limits were derived from the 
employed number of cores N via l max {N) = 4096 • N 1 ^ - 1. The 
results are shown in Fig. [7] Ideal scaling corresponds to a hori- 
zontal line. Again, the transition from one to several computing 
nodes degrades performance, whereas scaling on a single node, 
as well as in the multi-node range, is very good. By keeping the 
amount of work per core constant, the breakdown of scalabil- 
ity is shifted to 4096 cores, compared with 1024 in the strong 
scaling test. 

It is interesting to note that the scaling within a single node 
is actually slightly superlinear; this is most likely because in this 
setup, the amount of memory per core decreases with increasing 
problem size, which in turn can improve the amount of cache 
re-use and reduce memory bandwidth per core. 

5.2.3. General scaling and efficiency 

The preceding two sections did not cover cases with small SHTs. 
This scenario is interesting, however, since in the limit of small 
/max those components of the SHT implementation with com- 
plexities lower than <9(/ max ) (like the FFT steps of Eqs. [3] and 
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Fig. 8. Accumulated wall-clock time for spin=0 and spin=2 SHT 
pairs at a wide range of different band limits. For every run the 
number of cores was chosen sufficiently small to keep paralleli- 
sation overhead low. 
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Fig. 10. Relative speed-up when performing several SHTs simul- 
taneously, compared with sequential execution. The SHT had a 
band limit of / max = 8191. 
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Fig. 9. Fraction of theoretical peak performance reached by var- 
ious SHT pairs. For every run the number of cores was chosen 
sufficiently small to keep parallelisation overhead low. 



Fig. 11. Relative memory overhead, i.e. the fraction of total 
memory that is not occupied by input and output data of the SHT. 
For low l max this is dominated by the program binary, for high 
'max by temporary arrays. 



|6]l may begin to dominate execution time. Fig.[8]shows the total 
wall-clock time for SHT pairs over a very wide range of band 
limits; to minimise the impact of communication, the degree of 
parallelisation was kept as low as possible for the runs in ques- 
tion. As expected, the Z^ ax scaling is a very good model for 
the execution times at / max > 511. Below this limit, the FFTs, 
precomputations for the spherical harmonic recursion, memory 
copy operations and other parts of the code begin to dominate. 

In analogy to one of the tests described in iReineckd (120111) . 
we computed a lower limit for the number of executed floating- 
point operations per second in libsharp's SHTs and compared 
the result with the theoretical peak performance achievable on 
the given hardware, which is eight operations per clock cycle 
(four additions and four multiplications) or 21.6GFlops/s per 
core. Fig. [9] shows the results. In contrast to libpsht, which 
reached approximately 22% for 5 = and 43% for s — 2, both 
scalar and tensor harmonic transforms exhibit very similar per- 
formance levels and almost reach 70% of theoretical peak in the 
most favourable regime, thanks to the changed structure of the 
inner loops. For the / max range that is typically required in cos- 
mological applications, performance exceeds 50% (even when 
MPI is used), which is very high for a practically useful algo- 
rithm on this kind of computer architecture. 



5.2.4. Multiple simultaneous SHTs 

The computation of the s Ai,„{§) coefficients accounts for roughly 
half the arithmetic operations in an SHT. If several SHTs with 
identical grid geometry and band limit are computed simulta- 
neously, it is possible to re-use these coefficients, thereby re- 
ducing the overall operation count. Fig.[10]shows the speed-ups 
compared to sequential execution for various scenarios, which 
increase with the number of transforms and reach saturation 
around a factor of 1 .6. This value is lower than the naively ex- 
pected asymptotic factor of 2 (corresponding to avoiding half of 
the arithmetic operations), since the changed algorithm requires 
more memory transfers between Level- 1 cache and CPU regis- 
ters and therefore operates at a lower percentage of theoretical 
peak performance. Nevertheless, running SHTs simultaneously 
is evidently beneficial and should be used whenever possible. 

5.2.5. Memory overhead 

Especially at high band limits, it is important that the SHT li- 
brary does not consume a large amount of main memory, to 
avoid memory exhaustion and subsequent swapping or code 
crashes. Libsharp is designed with the goal to keep the size of 
its auxiliary data structures much lower than the combined size 
of any SHT's input and output arrays. A measurement is shown 
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Fig. 12. Performance comparison between libsharp and shtns 
for varying / max and number of OpenMP threads. Note that re- 
duced autotuning was used for shtns at / max =16383 (see text). 



in Fig.[TT] Below the lowest shown band limit of 2047, memory 
overhead quickly climbs to almost 100%, since in this regime 
memory consumption is dominated by the executable and the 
constant overhead of the communication libraries, which on the 
testing machine amounts to approximately 50MB. In the impor- 
tant range (/ max > 2047), memory overhead lies below 45%. 

5.2.6. Comparison with existing implementations 

Table [TJ shows a performance comparison of synthesis/analysis 
SHT pairs between libsharp and various other SHT implemen- 
tations. In addition to the already mentioned shtns, Wavemoth, 
S 2 HAT and libpsht codes, w e also in cluded spinsfast 
(iHuffenbereer & Wandeltl l2010h. SS HT dMcEwen & Wiauxl 
l20l¥ and Glesp ( Doroshk evich et al.ll2005l) in the comparison. 
All computations shared a common band limit of 2047 and were 
executed on a single core, since the corresponding SHTs are sup- 
ported by all libraries and are very likely carried out with a com- 
paratively high efficiency by all of them. The large overall num- 
ber of possible parameters (7 max , spin, number of simultaneous 
transforms, degree and kind of parallelisation, choice of grid, 
etc.) prevented a truly comprehensive study. 

Overall, libsharp's performance is very satisfactory and 
exhibits speed-ups of more than an order of magnitude in sev- 
eral cases. The table also demonstrates libsharp's flexibility, 
since it supports all of the other codes' "native" grid geometries, 
which is required for direct comparisons. 

The three last columns list time ratios measured under dif- 
ferent assumptions: Ravx reflects values that can be expected 
on modern (2012 and later) AMD/Intel CPUs supporting AVX, 
#sse2 applies to older (2001 and later) CPUs with the SSE2 in- 
struction set. /? sca i al should be used for CPUs from other vendors 
like IBM or ARM, since libsharp does not yet support vectori- 
sation for these architectures. 

Fig. [12] shows the relative performance of identical SHT 
pairs on a full Gauss-Legendre grid with s=0 for libsharp and 
shtns. For these measurements the benchmarking code deliv- 
ered with shtns was adjusted to measure SHT times in a similar 
fashion as was described above. The plotted quantity is shtns 
wall-clock time divided by libsharp wall-clock time for vary- 
ing Z max and number of OpenMP threads. It is evident that shtns 
has a significant advantage for small band limits (almost an or- 
der of magnitude) and maintains a slight edge up to / max =8191. 
It must be noted, however, that the measured times do not in- 



clude the overhead for auto-tuning and necessary precalcula- 
tions, which in the case of shtns are about an order of mag- 
nitude more expensive than the SHTs themselves. As a conse- 
quence, its performance advantage only pays off if many identi- 
cal SHT operations are performed within one run. The origin of 
shtns's performance advantage has not been studied in depth; 
however, a quick analysis shows that the measured time differ- 
ences scale roughly like Z^ ax , so the following explanations are 
likely candidates: 

- libsharp performs all of its precomputations as part of the 
time-measured SHT 

- libsharp's flexibility with regard to pixelisation and stor- 
age arrangement of input and output data requires some ad- 
ditional copy operations 

- at low band limits the inferior performance of libsharp's 
FFT implementation has a noticeable impact on overall run 
times. 

The relative performance of both libraries is remarkably in- 
sensitive to the number of OpenMP threads; this indicates that 
the performance differences are located in parallel code regions 
as opposed to sequential ones. 

For Z max = 16383, the time required by the default shtns 
autotuner becomes very long (on the order of wall-clock hours), 
so that we decided to invoke it with an option for reduced tuning. 
It is likely a consequence of this missed optimisation that, at this 
band limit, libsharp is the better-performing code. 

6. Conclusions 

Judging from the benchmarks presented in the preceding sec- 
tion, the goals that were set for the libsharp library have been 
reached: it exceeds libpsht in terms of performance, supports 
recent developments in microprocessor technology, allows using 
distributed memory systems for a wider range of applications, 
and is slightly easier to use. On the developer side, the modular 
design of the code makes it much more straightforward to add 
support for new instruction sets and other functionality. 

In some specific scenarios, especially for SHTs with com- 
paratively low band limits, libsharp does not provide the best 
performance of all available implementations, but given its ex- 
treme flexibility concerning grid types and the memory layout 
of its input/output data, as well as its compactness (« 8000 lines 
of portable and easily maintainable source code without external 
dependencies), this compromise certainly seems acceptable. 

The library has been successfully integrated into version 3.1 
of the HEALPix C++ and Fortran packages. There also exists 
an experimental version of the SSHlQ package with libsharp 
replacing the library's original SHT engine. Libsharp is also 
used as SHT engine in an upcomin g version of the Py thon pack- 
age NIFT\Q for signal inference (Se lig et al.ll2013l). Recently , 
the total convolution code conviqt dPrezeau & Reine cke 2010), 
which is a central co mponent of the Planck simulation pipeline 
dReinecke et al.l 120061) . has been updated and is now based 
on libsharp SHTs. There are plans f or a similar update of 
the a rtDeco deconvolution map maker (Keiha nen & Reineckd 
|20H . 

A potential future field of work is porting libsharp to 
Intel's "many integrated cores" architecture^, once sufficient 
compiler support for this platform has been established. The 

3 http : //www . mrao . cam . ac . uk/ ~ j dm5 7/ ssht/index . html 

4 http : //www . mpa- gar ching . mpg . de/ i f t/ni f ty/ 

5 http : //en . wikipedia . org/wiki/Intel_MIC 
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Table 1. Performance comparison with other implementations at / max = 2047, n core = 1. 



Code 


version 


grid 


spin 


"SHT 


^AVX 


^SSE2 


^scalar 


shtns 


2.31 


Gauss-Legendre 





1 


0.84 


0.88 


0.91 


Wavemoth (brute-force) 


Nov 20 11 


HEALPix (A? sidc = 1024) 





1 


1.63 


0.98 









" 





5 


1.59 


1.09 





Wavemoth (butterfly) 


Nov 2011 


HEALPix (Ar sidc = 1024) 





5 


0.96 


0.66 





libpsht 


Jan 20 11 


Gauss-Legendre 





1 


4.06 


2.30 


2.30 


» 









5 


2.66 


1.75 


1.62 








2 


1 


2.50 


1.48 


1.20 


" 






2 


5 


2.15 


1.44 


1.08 


spinsfast 


rl04 


ECP (Clenshaw-Curtis) 





1 


57.04 


32.12 


15.31 











5 


28.39 


18.72 


9.38 








2 


1 


16.99 


10.20 


4.73 








2 


5 


8.60 


5.66 


2.56 


SSHT 


l.Obl 


MW sampling theorem 





1 


20.91 


15.60 


9.46 








2 


1 


13.40 


9.29 


4.99 


S 2 HAT 


2.55beta 


HEALPix (W slde =1024) 





1 


12.33 


7.33 


3.60 


Glesp 


2 


Gauss-Legendre 





1 


55.32 


31.26 


14.95 



Notes. All tests had a band limit of / ra;lx = 2047 and were carried out on a single core. The grids used by libsharp and the respective comparison 
code were identical in each run. ^avx denotes the quotient of wall-clock times for the respective code and libsharp in the presence of the AVX 
instruction set, i?ssE2 is the quotient when SSE2 (but not AVX) is supported, and R xll iai was measured with both SSE2 and AVX disabled. The 
libsharp support for the MW sampling theorem used for the SSHT comparisons is experimental. For Wavemoth, butterfly matrix compression 
can optionally be enabled. In the benchmark given we requested an accuracy of 10~ 4 , which led to an extra requirement of 4 GB of precomputed 
data in memory. Note that when running on a single core, Wavemoth is at an advantage compared to the normal situation where the memory bus 
is shared between multiple cores. 

hardware appears to be very well suited for running SHTs, and Wandel t, B. D. & Gors ki, K. M. 2001, Phys. Rev. D. 63. 123002, 
the porting by itself would provide a welcome test for the adapt- arXiv:astro-ph/0008227 
ability of the library's code design. 
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